R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.6.4
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/Los_Angeles
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] htmlwidgets_1.6.4 compiler_4.3.2 fastmap_1.1.1 cli_3.6.2
[5] tools_4.3.2 htmltools_0.5.7 rstudioapi_0.15.0 yaml_2.3.8
[9] rmarkdown_2.25 knitr_1.45 jsonlite_1.8.8 xfun_0.41
[13] digest_0.6.33 rlang_1.1.2 evaluate_0.23
Q1. Git/GitHub
No handwritten homework reports are accepted for this course. We work with Git and GitHub. Efficient and abundant use of Git, e.g., frequent and well-documented commits, is an important criterion for grading your homework.
Apply for the Student Developer Pack at GitHub using your UCLA email. You’ll get GitHub Pro account for free (unlimited public and private repositories).
Create a private repository biostat-203b-2024-winter and add Hua-Zhou and TA team (Tomoki-Okuno for Lec 1; jonathanhori and jasenzhang1 for Lec 80) as your collaborators with write permission.
Top directories of the repository should be hw1, hw2, … Maintain two branches main and develop. The develop branch will be your main playground, the place where you develop solution (code) to homework problems and write up report. The main branch will be your presentation area. Submit your homework files (Quarto file qmd, html file converted by Quarto, all code and extra data sets to reproduce results) in the main branch.
After each homework due date, course reader and instructor will check out your main branch for grading. Tag each of your homework submissions with tag names hw1, hw2, … Tagging time will be used as your submission time. That means if you tag your hw1 submission after deadline, penalty points will be deducted for late submission.
After this course, you can make this repository public and use it to demonstrate your skill sets on job market.
This exercise (and later in this course) uses the MIMIC-IV data v2.2, a freely accessible critical care database developed by the MIT Lab for Computational Physiology. Follow the instructions at https://mimic.mit.edu/docs/gettingstarted/ to (1) complete the CITI Data or Specimens Only Research course and (2) obtain the PhysioNet credential for using the MIMIC-IV data. Display the verification links to your completion report and completion certificate here. You must complete Q2 before working on the remaining questions. (Hint: The CITI training takes a few hours and the PhysioNet credentialing takes a couple days; do not leave it to the last minute.)
Make the MIMIC v2.2 data available at location ~/mimic.
ls-l ~/mimic/
Refer to the documentation https://physionet.org/content/mimiciv/2.2/ for details of data files. Please, do not put these data files into Git; they are big. Do not copy them into your directory. Do not decompress the gz data files. These create unnecessary big files and are not big-data-friendly practices. Read from the data folder ~/mimic directly in following exercises.
Use Bash commands to answer following questions.
Answer: I created a symbolic link mimic to my MIMIC data folder. Here is the output of ls -l ~/mimic/:
ls-l ~/mimic/
total 48
-rw-rw-r--@ 1 zhangjiyin staff 13332 Jan 5 2023 CHANGELOG.txt
-rw-rw-r--@ 1 zhangjiyin staff 2518 Jan 5 2023 LICENSE.txt
-rw-rw-r--@ 1 zhangjiyin staff 2884 Jan 6 2023 SHA256SUMS.txt
drwxr-xr-x@ 24 zhangjiyin staff 768 Jan 5 23:41 hosp
drwxr-xr-x@ 11 zhangjiyin staff 352 Jan 5 23:41 icu
lrwxr-xr-x 1 zhangjiyin staff 61 Jan 24 22:46 mimic-iv-2.2 -> /Users/zhangjiyin/Desktop/ucla/23-24/winter/203B/mimic-iv-2.2
Display the contents in the folders hosp and icu using Bash command ls -l. Why are these data files distributed as .csv.gz files instead of .csv (comma separated values) files? Read the page https://mimic.mit.edu/docs/iv/ to understand what’s in each folder.
Briefly describe what Bash commands zcat, zless, zmore, and zgrep do.
(Looping in Bash) What’s the output of the following bash script?
for datafile in ~/mimic/hosp/{a,l,pa}*.gzdols-l$datafiledone
Display the number of lines in each data file using a similar loop. (Hint: combine linux commands zcat < and wc -l.)
Display the first few lines of admissions.csv.gz. How many rows are in this data file? How many unique patients (identified by subject_id) are in this data file? Do they match the number of patients listed in the patients.csv.gz file? (Hint: combine Linux commands zcat <, head/tail, awk, sort, uniq, wc, and so on.)
What are the possible values taken by each of the variable admission_type, admission_location, insurance, and ethnicity? Also report the count for each unique value of these variables. (Hint: combine Linux commands zcat, head/tail, awk, uniq -c, wc, and so on; skip the header line.)
To compress, or not to compress. That’s the question. Let’s focus on the big data file labevents.csv.gz. Compare compressed gz file size to the uncompressed file size. Compare the run times of zcat < ~/mimic/labevents.csv.gz | wc -l versus wc -l labevents.csv. Discuss the trade off between storage and speed for big data files. (Hint: gzip -dk < FILENAME.gz > ./FILENAME. Remember to delete the large labevents.csv file after the exercise.)
Q4. Who’s popular in Price and Prejudice
You and your friend just have finished reading Pride and Prejudice by Jane Austen. Among the four main characters in the book, Elizabeth, Jane, Lydia, and Darcy, your friend thinks that Darcy was the most mentioned. You, however, are certain it was Elizabeth. Obtain the full text of the novel from http://www.gutenberg.org/cache/epub/42671/pg42671.txt and save to your local folder.
Explain what wget -nc does. Do not put this text file pg42671.txt in Git. Complete the following loop to tabulate the number of times each of the four characters is mentioned using Linux commands.
Answer:wget -nc downloads the file from the URL if the file does not exist in the current directory.
for char in Elizabeth Jane Lydia Darcydoecho$char:# some bash commands heregrep-o-i$char pg42671.txt |wc-ldone
It shows that Elizabeth was the most mentioned. She was mentioned 634 times in the book. Darcy was mentioned 418 times in the book. Jane was mentioned 293 times in the book. Lydia was mentioned 171 times in the book. The -i option in command grep is used for case-insensitive searching. The -o option in command grep is used for printing each match on a new line . The -l option in command wc is used for printing the number of lines in a file.
What’s the difference between the following two commands?
echo'hello, world'> test1.txt
and
echo'hello, world'>> test2.txt
Answer: The first command overwrites the file test1.txt if the file exists. The second command appends the text to the file test2.txt if the file exists.
Using your favorite text editor (e.g., vi), type the following and save the file as middle.sh:
#!/bin/sh# Select lines from the middle of a file.# Usage: bash middle.sh filename end_line num_lineshead-n"$2""$1"|tail-n"$3"
Using chmod to make the file executable by the owner, and run
./middle.sh pg42671.txt 20 5
Explain the output. Explain the meaning of "$1", "$2", and "$3" in this shell script. Why do we need the first line of the shell script?
Answer: The output is the 5 lines from line 16 to line 20 of the file pg42671.txt. The "$1" is the first argument of the shell script, the file name, pg42671.txt. The "$2" is the second argument of the shell script, the end line. The "$3" is the third argument of the shell script, the number of lines. Therefore, the shell script selects the lines from the middle of the file. To elucidate, the shell script first use command head to select the first 20 lines from the file pg42671.txt and then pass this output to the command tail. Then, command tail selects the last 5 lines from the output generated by command head.
The #! symbol is called a shebang or hashbang. It is a special character sequence that appears at the beginning of a script or an executable file in Unix-like operating systems. The shebang is followed by the path to the interpreter that should be used to execute the script. Therefore, the first line of the shell script tells the system to use the Bourne shell (/bin/sh) as the interpreter for executing the script.
Q5. More fun with Linux
Try following commands in Bash and interpret the results: cal, cal 2024, cal 9 1752 (anything unusual?), date, hostname, arch, uname -a, uptime, who am i, who, w, id, last | head, echo {con,pre}{sent,fer}{s,ed}, time sleep 5, history | tail.
Answer: Here is the output of the commands:
cal
January 2024
Su Mo Tu We Th Fr Sa
1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 _2_5 26 27
28 29 30 31
cal: display the calendar of the current month.
cal 2024
2024
January February March
Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa
1 2 3 4 5 6 1 2 3 1 2
7 8 9 10 11 12 13 4 5 6 7 8 9 10 3 4 5 6 7 8 9
14 15 16 17 18 19 20 11 12 13 14 15 16 17 10 11 12 13 14 15 16
21 22 23 24 _2_5 26 27 18 19 20 21 22 23 24 17 18 19 20 21 22 23
28 29 30 31 25 26 27 28 29 24 25 26 27 28 29 30
31
April May June
Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa
1 2 3 4 5 6 1 2 3 4 1
7 8 9 10 11 12 13 5 6 7 8 9 10 11 2 3 4 5 6 7 8
14 15 16 17 18 19 20 12 13 14 15 16 17 18 9 10 11 12 13 14 15
21 22 23 24 25 26 27 19 20 21 22 23 24 25 16 17 18 19 20 21 22
28 29 30 26 27 28 29 30 31 23 24 25 26 27 28 29
30
July August September
Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa
1 2 3 4 5 6 1 2 3 1 2 3 4 5 6 7
7 8 9 10 11 12 13 4 5 6 7 8 9 10 8 9 10 11 12 13 14
14 15 16 17 18 19 20 11 12 13 14 15 16 17 15 16 17 18 19 20 21
21 22 23 24 25 26 27 18 19 20 21 22 23 24 22 23 24 25 26 27 28
28 29 30 31 25 26 27 28 29 30 31 29 30
October November December
Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa
1 2 3 4 5 1 2 1 2 3 4 5 6 7
6 7 8 9 10 11 12 3 4 5 6 7 8 9 8 9 10 11 12 13 14
13 14 15 16 17 18 19 10 11 12 13 14 15 16 15 16 17 18 19 20 21
20 21 22 23 24 25 26 17 18 19 20 21 22 23 22 23 24 25 26 27 28
27 28 29 30 31 24 25 26 27 28 29 30 29 30 31
cal 2024: display the calendar of the year 2024.
cal 9 1752
September 1752
Su Mo Tu We Th Fr Sa
1 2 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
cal 9 1752: display the calendar of the month September in the year 1752. The calendar of September 1752 is unusual because the Gregorian calendar was adopted in the British Empire in September 1752. The calendar was changed from the Julian calendar to the Gregorian calendar. The Julian calendar was 11 days behind the Gregorian calendar. So the 11 days from September 3 to September 13 were skipped.
date
Thu Jan 25 09:58:22 PST 2024
date: display the current date and time.
hostname
zhangjiyindeAir.lan
hostname: display the name of the host.
arch
arm64
arch: display the machine hardware name.
uname-a
Darwin zhangjiyindeAir.lan 21.6.0 Darwin Kernel Version 21.6.0: Thu Mar 9 20:10:19 PST 2023; root:xnu-8020.240.18.700.8~1/RELEASE_ARM64_T8101 arm64
uptime: display the current time, how long the system has been running, how many users are currently logged on, and the system load averages for the past 1, 5, and 15 minutes.
who am i
zhangjiy tty?? Jan 25 09:58
who am i: display the current user.
who
zhangjiyin console Jan 23 01:06
zhangjiyin ttys000 Jan 24 00:52
who: display the users who are currently logged in.
# w
w: display the users who are currently logged in and what they are doing.
Open the project by clicking rep-res-3rd-edition.Rproj and compile the book by clicking Build Book in the Build panel of RStudio. (Hint: I was able to build git_book and epub_book but not pdf_book.)
The point of this exercise is (1) to get the book for free and (2) to see an example how a complicated project such as a book can be organized in a reproducible way.
For grading purpose, include a screenshot of Section 4.1.5 of the book here.
Answer:
I was also able to build git_book and epub_book but not pdf_book. Here is the screenshot of Section 4.1.5 of the git_book. Here is the screenshot of Section 4.1.5 of the epub_book.